New functions and libraries:
plotly libraryFirst load the data of all state-level polls for the 2020 presidential election.
# setting message=FALSE hides all the tidyverse loading messages
library(tidyverse)
Pres2020.StatePolls <- readRDS(file="Pres2020.StatePolls.Rds")
glimpse(Pres2020.StatePolls)
## Rows: 1,545
## Columns: 19
## $ StartDate <date> 2020-03-21, 2020-03-24, 2020-03-24, 2020-03-28, 2020-0…
## $ EndDate <date> 2020-03-30, 2020-04-03, 2020-03-29, 2020-03-29, 2020-0…
## $ DaysinField <dbl> 10, 11, 6, 2, 3, 5, 2, 2, 7, 3, 3, 3, 2, 2, 3, 4, 10, 1…
## $ MoE <dbl> 2.8, 3.0, 4.2, NA, 4.0, 1.7, 3.0, 3.1, 4.1, 4.4, NA, NA…
## $ Mode <chr> "Phone/Online", "Phone/Online", "Live phone - RDD", "Li…
## $ SampleSize <dbl> 1331, 1000, 813, 962, 602, 3244, 1035, 1019, 583, 500, …
## $ Biden <dbl> 41, 47, 48, 67, 46, 46, 46, 48, 52, 42, 48, 50, 52, 38,…
## $ Trump <dbl> 46, 34, 45, 29, 46, 40, 48, 45, 39, 49, 47, 41, 43, 49,…
## $ Winner <chr> "Rep", "Dem", "Dem", "Dem", "Dem", "Rep", "Dem", "Dem",…
## $ poll.predicted <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Funded <chr> "UtahPolicy.com & KUTV 2News", "Sacred Heart University…
## $ Conducted <chr> "Y2 Analytics", "GreatBlue Research", "LHK Partners Inc…
## $ margin <dbl> -5, 13, 3, 38, 0, 6, -2, 3, 13, -7, 1, 9, 9, -11, 6, -1…
## $ DaysToED <drtn> 218 days, 214 days, 219 days, 219 days, 216 days, 213 …
## $ StateName <chr> "Utah", "Connecticut", "Wisconsin", "California", "Mich…
## $ EV <int> 6, 7, 10, 55, 16, 29, 16, 16, 12, 15, 10, 16, 11, 6, 16…
## $ State <chr> "UT", "CT", "WI", "CA", "MI", "FL", "GA", "MI", "WA", "…
## $ BidenCertVote <dbl> 38, 59, 49, 64, 51, 48, 50, 51, 58, 49, 49, 51, 49, 41,…
## $ TrumpCertVote <dbl> 58, 39, 49, 34, 48, 51, 49, 48, 39, 50, 49, 48, 49, 58,…
Variables of potential interest include:
Suppose that I give you 10 polls from a state.
Load in the data and create the some mutations to create new variables.
Pres2020.StatePolls <- Pres2020.StatePolls %>%
mutate(BidenNorm = Biden/(Biden+Trump),
TrumpNorm = 1-BidenNorm,
Biden = Biden/100,
Trump=Trump/100)
How can you use them to create a probability? Discuss! (I can think of 3 ways.)
How does this vary across states? The joys of
group_by(). Note that group_by() defines what
happens for all subsequent code in that code chunk. So here we are going
to calculate the mean separately for each state.
stateprobs <- Pres2020.StatePolls %>%
group_by(StateName) %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin2 = mean(Biden),
BidenProbWin3 = mean(BidenNorm))
stateprobs
## # A tibble: 50 × 4
## StateName BidenProbWin1 BidenProbWin2 BidenProbWin3
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 0 0.389 0.407
## 2 Alaska 0 0.442 0.466
## 3 Arizona 0.840 0.484 0.519
## 4 Arkansas 0 0.381 0.395
## 5 California 1 0.618 0.661
## 6 Colorado 1 0.534 0.571
## 7 Connecticut 1 0.584 0.631
## 8 Delaware 1 0.603 0.627
## 9 Florida 0.798 0.486 0.517
## 10 Georgia 0.548 0.474 0.504
## # … with 40 more rows
Clearly they differ, so let’s visualize to try to understand what is
going on. Install the library plotly
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
gg <- stateprobs %>%
ggplot(aes(x=BidenProbWin2, y=BidenProbWin3,text=paste(StateName))) +
geom_point() +
geom_abline(intercept=0,slope=1) +
labs(x= "Probability as % Support",
y = "Probability as Two-Party % Support",
title = "Comparing Probability of Winning Measures")
ggplotly(gg,tooltip = "text")
So removing the undecided and making the probabilities for Biden and Trump sum to 100% is consequential.
What about if we compare these measures to the fration of polls with a given winner? After all, it seems implausible that the Biden would ever lose California or Trump would ever lose Tennessee.
library(plotly)
gg <- stateprobs %>%
ggplot(aes(x=BidenProbWin2, y=BidenProbWin1,text=paste(StateName))) +
geom_point() +
geom_abline(intercept=0,slope=1) +
labs(x= "Probability as % Support",
y = "Probability as % Polls Winning",
title = "Comparing Probability of Winning Measures")
ggplotly(gg,tooltip = "text")
So what do you think? Exactly the same data, but just different impications depending on how you choose to measure the probability of winning a state. Data sciene is as much about argument and reasoning as it is about coding. How we measure a concept is often critical to the conclusions that we get.
But we want to combine these probabilities with the Electoral College votes in each state. Not every state has the same amount of Electoral College votes – it is typically given by the number of Senators (2) plus the number of representatives (at least 1) so we need to account for this if we want to make a projection about who is going to win the Electoral College.
PA.dat <- Pres2020.StatePolls %>%
filter(State == "PA")
PA.dat %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin2 = mean(Biden),
BidenProbWin3 = mean(BidenNorm))
## # A tibble: 1 × 3
## BidenProbWin1 BidenProbWin2 BidenProbWin3
## <dbl> <dbl> <dbl>
## 1 0.916 0.499 0.529
EV for
Biden?EV that
we want to use to compute the expected number of electoral college
votes. But recall that when we summarize we change the
tibble to be the output of the function. So how do we keep the number of
Electoral College votes for a future mutation?PA.dat %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin2 = mean(Biden),
BidenProbWin3 = mean(BidenNorm),
EV = mean(EV)) %>%
mutate(BidenEV1 = BidenProbWin1*EV,
BidenEV2 = BidenProbWin2*EV,
BidenEV3 = BidenProbWin3*EV)
## # A tibble: 1 × 7
## BidenProbWin1 BidenProbWin2 BidenProbWin3 EV BidenEV1 BidenEV2 BidenEV3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.916 0.499 0.529 20 18.3 9.98 10.6
Note that we are calculation the Expected Value of the Electoral College votes using: Probability that Biden wins state i X Electoral College Votes in State i. This will allocate fractions of Electoral College votes even though the actual election is winner-take all. This is OK because the fractions reflect the probability that an alternative outcome occurs.
Quick Exercise How can we get compute the expected number of Electoral College votes for Trump in each measure? NOTE: There are at least 2 ways to do this because this is a 2 candidate race
# INSERT CODE HERE
EV-BidenEV, or compute TrumpProbWinPres2020.StatePolls %>%
group_by(StateName) %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin3 = mean(BidenNorm),
EV = mean(EV),
State = first(State)) %>%
mutate(State = State,
BidenECVPredicted1 = EV*BidenProbWin1,
TrumpECVPredicted1 = EV- BidenECVPredicted1,
BidenECVPredicted3 = EV*BidenProbWin3,
TrumpECVPredicted3 = EV- BidenECVPredicted3) %>%
summarize(BidenECVPredicted1=sum(BidenECVPredicted1),
BidenECVPredicted3=sum(BidenECVPredicted3),
TrumpECVPredicted1=sum(TrumpECVPredicted1),
TrumpECVPredicted3=sum(TrumpECVPredicted3),)
## # A tibble: 1 × 4
## BidenECVPredicted1 BidenECVPredicted3 TrumpECVPredicted1 TrumpECVPredicted3
## <dbl> <dbl> <dbl> <dbl>
## 1 345. 289. 190. 246.
Pres2020.StatePolls %>%
group_by(StateName) %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin3 = mean(BidenNorm),
EV = mean(EV)) %>%
mutate(BidenECVPredicted1 = EV*BidenProbWin1,
TrumpECVPredicted1 = EV- BidenECVPredicted1,
BidenECVPredicted3 = EV*BidenProbWin3,
TrumpECVPredicted3 = EV- BidenECVPredicted3) %>%
summarize(BidenECV1 = sum(BidenECVPredicted1),
TrumpECV1 = sum(TrumpECVPredicted1),
BidenECV3 = sum(BidenECVPredicted3),
TrumpECV3 = sum(TrumpECVPredicted3))
## # A tibble: 1 × 4
## BidenECV1 TrumpECV1 BidenECV3 TrumpECV3
## <dbl> <dbl> <dbl> <dbl>
## 1 345. 190. 289. 246.
Quick Exercise Could also do this for just polls conducted in the last 7 days. How?
# INSERT CODE HERE
THINKING: What about states that do not have any polls? What should we do about them? Is there a reason why they might not have a poll? Is that useful information? Questions like this become more relevant when we start to restrict the sample.
Here are the number of polls done in each state in the last 3 days. Note that when we use fewer days our measure based on the percentage of polls won may be more affected?
Pres2020.StatePolls %>%
filter(DaysToED < 3) %>%
count(State) %>%
ggplot(aes(x=n)) +
geom_bar() +
scale_x_continuous(breaks=seq(0,15,by=1)) +
labs(x="Number of Polls in a State",
y="Number of States",
title="Number of Polls in States \n in the Last 3 Days of 2020")
PA.dat <- Pres2020.StatePolls %>%
filter(State == "PA")
Write a loop that, for each iteration….
sample_nBut before you start, how does this differ from what we did previously with the PA data? What do we need to add to the code?
ProbBidenWin.PA <- NULL
NSamples <- 1000
for(i in 1:NSamples){
ProbBidenWin.PA <- PA.dat %>%
sample_n(nrow(PA.dat),replace = TRUE) %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin2 = mean(Biden),
BidenProbWin3 = mean(BidenNorm)) %>%
bind_rows(ProbBidenWin.PA)
}
ProbBidenWin.PA %>%
summarize(LCI1 = quantile(BidenProbWin1,.025),
ProbWin1 = mean(BidenProbWin1),
UCI1 = quantile(BidenProbWin1,.975),
LCI3 = quantile(BidenProbWin3,.025),
ProbWin3 = mean(BidenProbWin3),
UCI3 = quantile(BidenProbWin3,.975))
## # A tibble: 1 × 6
## LCI1 ProbWin1 UCI1 LCI3 ProbWin3 UCI3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.867 0.916 0.958 0.526 0.529 0.532
Quick Exercise Can you do this for Wisconsin? Take it from start to finish using code from above.
# INSERT CODE HERE
B
times and go…ElectoralCollegeVotes <- NULL
B <- 100
Number.Polls <- nrow(Pres2020.StatePolls)
for(i in 1:B){
dat <- sample_n(Pres2020.StatePolls,
Number.Polls,
replace=TRUE)
ElectoralCollegeVotes <- dat %>%
group_by(StateName) %>%
summarize(BidenProbWin1 = mean(Biden > Trump),
BidenProbWin2 = mean(Biden),
BidenProbWin3 = mean(BidenNorm),
EV = mean(EV)) %>%
mutate(BidenECVExpected1 = BidenProbWin1*EV,
TrumpECVExpected1 = EV - BidenECVExpected1,
BidenECVExpected3 = BidenProbWin3*EV,
TrumpECVExpected3 = EV - BidenECVExpected3) %>%
summarize(BidenECV1 = sum(BidenECVExpected1),
TrumpECV1 = sum(TrumpECVExpected1),
BidenECV3 = sum(BidenECVExpected3),
TrumpECV3 = sum(TrumpECVExpected3)) %>%
bind_rows(ElectoralCollegeVotes)
}
The wonderful thing about using the empirical bootstrap is that we can compute any probability of interest.
ElectoralCollegeVotes %>%
summarize(BidenWin1 = mean(BidenECV1 > 270),
BidenWin3 = mean(BidenECV3 > 270))
## # A tibble: 1 × 2
## BidenWin1 BidenWin3
## <dbl> <dbl>
## 1 1 1
Quick Exercise What is the probability that Biden is predicted to get more than 300 electoral college votes from the states that have conducted a poll?
ElectoralCollegeVotes %>%
summarize(BidenWin1 = mean(BidenECV1 > 300),
BidenWin3 = mean(BidenECV3 > 300))
## # A tibble: 1 × 2
## BidenWin1 BidenWin3
## <dbl> <dbl>
## 1 1 0
Perhaps it makes more sense to visualize this. What type of data is this? Categorical or continuous? What graph should we use as a result? Let’s check what the variable looks like:
ElectoralCollegeVotes %>%
select(BidenECV1) %>%
summary()
## BidenECV1
## Min. :337.9
## 1st Qu.:343.0
## Median :344.9
## Mean :345.0
## 3rd Qu.:346.8
## Max. :351.2
Since continuous, let’s try a histogram with 40 bins…
ggplot(ElectoralCollegeVotes) +
geom_histogram(aes(x=BidenECV1), fill= "BLUE",bins=40)+
geom_histogram(aes(x=TrumpECV1), fill= "RED",bins=40) +
geom_histogram(aes(x=BidenECV3), fill= "light blue",bins=40)+
geom_histogram(aes(x=TrumpECV3), fill= "pink",bins=40) +
geom_vline(xintercept=301) + # plot actual outcome
geom_vline(xintercept=232) + # plot actual outcome
labs(x="Electoral College Vote Estimates",
y="Number of Simulations")
If we want to round the values to the nearest integer (i.e., no decimals) we can use a barplot to plot every unique value after rounding! we get…
ggplot(ElectoralCollegeVotes) +
geom_bar(aes(x=round(BidenECV1,digits=0)), fill= "BLUE") +
geom_bar(aes(x=round(TrumpECV1,digits=0)), fill= "RED") +
geom_bar(aes(x=round(BidenECV3,digits=0)), fill= "light blue")+
geom_bar(aes(x=round(TrumpECV3,digits=0)), fill= "pink") +
geom_vline(xintercept=301) +
geom_vline(xintercept=232) +
labs(x="Electoral College Vote Estimates",
y="Number of Simulations")